1. Overview of Phenotypic Data

Distributions of density fold change and lifespan

Most of the phenotypes except for mean replicative lifespan deviate from normal.

Note that there are 17 strains that do not have respiration or fermentation density, AND mean and maximal replicative lifespan.

Pairwise correlation between different genotypes

Pairs that might be correlated

Respiration Density FC vs Fermentation Density FC

Spearman’s rank correlation rho=0.786827, p-value < 2.2e-16

CLS vs Mean RLS

Spearman’s rank correlation rho=0.121463, p-value = 0.1374. No significant correlation between CLS and Mean RLS.

Mean RLS vs Max RLS

Spearman’s rank correlation rho=0.6762903, p-value < 2.2e-16.

Respiration Density vs. CLS

Spearman’s rank correlation rho=0.2764692, p-value = 0.0005897. Significant but low level of correlation.

Note that strains that have very low respiration density can have a huge range of CLS.

Fermentation Density vs. MeanRLS

Spearman’s rank correlation rho=0.2556684, p-value = 0.001533. Significant but low level of correlation.

2. Overview of Omics Data

Normalization

Given that all data points are log2(fold change), I would refrain from doing any normalization that we would do on raw data.

Missingness

There are non negligible amount of missing values in this dataset. We would like to get an idea of the mechanism of missingness to come up with an appropriate method to deal with the missing values. Specifically, we would like to know whether missing values occur randomly, or not at random, i.e. when the abundance of a chemical species is close enough to the limit of detection we would expect to see higher rate of missing values.

In order to identify the missing value mechanisms in the data, we plot the number of missingness among the 168 strains that have phenotypic data versus the average log2(fold change) value. Each dot represents a molecule. The red line shows 10% missingness. The figures show that number of missingness is not correlated with intensity of signal, suggesting missing at random. Removing molecules with too many missing values and performing KNN imputation would be a reasonable way to deal with missing values.

Imputation

I kept molecules with no more than 16 missing values among the 168 strains that have phenotypic data (<10% missingness), and performed KNN imputation to fill in missing values. Allowing more missingness would not add too many molecules and will result in insufficient complete cases for computing neighbors in KNN.

To make sure that imputation does not drastically alter the data, I performed PCA with molecules with no missing data and PCA with imputated data. The left panel shows PC1 vs PC2 with complete data, and the right panel shows PC1 vs PC2 with imputed data. Note that the sign of any PC is arbitrary.

It seems that imputation does not change the data drastically. Subsequent analyses that require no missing values are done on imputated data.

3. Exploratory Analysis Combining Phenotypes and Omics

The following PCA plots are based on imputated data. Dots are colored by phenotypic measures. Grey dots are lines with missing phenotypic data.

Respiration Protein colored by Chronological Lifespan

Respiration Metabolite colored by Chronological Lifespan

Respiration Lipid colored by Chronological Lifespan

There may be some weak correlation between chronological lifespan and PC1 of respiration omics.

Fermentation Protein colored by Mean Replicative Lifespan

Fermentation Metabolite colored by Mean Replicative Lifespan

Fermentation Lipid colored by Mean Replicative Lifespan

It is difficult to see much correlation between mean replicative lifespan and the first three PCs of fermentation omics. This does not mean that there are not molecules that are correlated with replicatie lifespan.

Respiration Protein colored by Respiration Density Fold Change

Respiration Metabolite colored by Respiration Density Fold Change

Respiration Lipid colored by Respiration Density Fold Change

PC1 is highly correlated with respiration density.

Fermentation Protein colored by Fermentation Density Fold Change

Fermentation Metabolite colored by Fermentation Density Fold Change

Fermentation Lipid colored by Fermentation Density Fold Change

The first three PCs are somewhat correlated with fermentation density.

4. Univariate Analysis

We are interested in identifying molecules whose levels are correlated with phenotypes.

Considering that the distributions of most phenotypes, as well as most molecules are not normal, we use Spearman’s ranke-order correlation, which is a nonparametric version the Pearson product-moment correlation. Spearman’s correlation coefficient rho measures the strength and direction of association between the two variables.

I performed Spearman’s correlation between respiration omics and chronological lifespan, respiration omics and respiration density fold change, fermentation omics and replicative lifespan and fermentation omics and fermentation density fold change. Complete results are stored in “SpearmanCor.RData”. Lists of molecules that are significantly correlated with phenotypes are stored in “SpearmanCor_ForEnrichment.xlsx”. You are welcome to filter the data with different stringency by altering FDR or Bonferroni levels.

For proteins, it would make sense to perform GO enrichment analysis with the significant molecules on the Saccharomyces Genome Database www.yeastgenome.org (“Analyze”-“GO Term Finder”“). Note that it is necessary to specify background set of genes. In this case the background set is the list of proteins that went through Spearman correlation test, which is a subset of all identified proteins. For metabolites, since there are large number of unidentified molecules with unknown m/z, it is difficult to perform enrichment analysis. For lipids, since we have limited a priori knowledge on lipid function we cannot perform enrichment analysis. Another thing to note is that there are many molecules that are correlated with density fold change, and it would not make sense to perform GO enrichment if most of the tested molecules are significant.

5. Multivariate Analysis

The goal is to use supervised learning methods to predict phenotypes with omics data, and to identify molecules that are critical in prediciting phenotypes. I selected three commonly used regression models: partial least square, random forest and elastic net.

Partial least squares (PLS) is most commonly used in the field of chemometrics. The PLS approach attempts to find rotations of features that explain both the response and the predictors, and it accomondates correlated predictors. The number of partial least squares directions (ncomp) used in PLS is a tuning parameter that is chosen by 5 fold cross-validation.

Random forests is a non-linear ensemble learning method. Random forests for regression operates by constructing a multitude of decision trees with subsets of predictors and outputting the mean prediction of the individuals trees. It accomondates data with correlated predictors, and can be used to rank the importance of predictors. The most important tuning parameter is the number of variables to possibly split at in each node (mtry) and is chosen by 5 fold cross-validation.

Elastic net is a regularized regression method that linearly combines the lasso and ridge regression methods. The first of the two important tuning parameters is alpha, where when alpha=1 the model uses lasso penalty, when alpha=0 the ridge penalty, and when alpha in between of 0 and 1 the model is somewhere in between of lasso and ridge. Another important tuning parameter is lambda, which is the size of penalty. alpha and lambda are chosen by 5 fold cross-validation.

For each one of these models, I saved the model information, as well as variable importance in RData files (“PLS.RData”,“RF.RData”,“EN.RData”). For models that we think does a good job predicting lifespan, we could take the first n most important variables and perform enrichment analysis, where n is an arbitrary positive integer.

Partial least squares models

The RespProt model has root of mean squared error (RMSE) of 9.93, and R squared of 0.25. The RespMet model has RMSE of 9.83 and R squared of 0.28. The RespLip model has RMSE of 10.35 and R squared of 0.22. From the plots we see that the PLS model does not do a great job predicting chronological lifespan with respiration omics. The relative high R squared is mostly driven by the two clouds of high and low lifespans.

The FermProt model has RMSE of 4.94, and R squared of 0.23. The FermMet model has RMSE of 5.37 and R squared of 0.14. The FermLip model has RMSE of 5.52 and R squared of 0.08. From the plots we see that the PLS model does not do a great job predicting replicative lifespan with fermentation omics.

Random forest models

The RespProt model has RMSE of 10.20, and R squared of 0.23. The RespMet model has RMSE of 10.07 and R squared of 0.25. The RespLip model has RMSE of 10.53 and R squared of 0.17. From the plots we see that the random forest models do a satisfactory job predicting chronological lifespan with respiration omics. It is worth looking into variables with high importance in these models.

The FermProt model has RMSE of 5.07, and R squared of 0.19. The FermMet model has RMSE of 5.43 and R squared of 0.07. The FermLip model has RMSE of 5.41 and R squared of 0.09. From the plots we see that the random forest models do an OK job predicting replicative lifespan with fermentation omics. However we need to be cautious with the models since the R squared values are pretty low.

Elastic net models

The RespProt model has RMSE of 10.18, and R squared of 0.25. The RespMet model has RMSE of 10.14 and R squared of 0.23. The RespLip model has RMSE of 10.36 and R squared of 0.19. From the plots we see that the elastic net models do not do a great job predicting chronological lifespan with respiration omics.

The FermProt model has RMSE of 4.80, and R squared of 0.26. The FermMet model has RMSE of 5.29 and R squared of 0.12. The FermLip model has RMSE of 5.61 and R squared of 0.11. From the plots we see that the elastic net models for protein and metabolite do an OK job predicting replicative lifespan with fermentation omics. Lipid is not useful in predicting replicative lifespan.

Random forest in general does better job predicting lifespan with omics. The caveat is that the mechanism behind random forest is not immediately clear. We can still make sense of the models by examining variables with high importance though.

The two linear models, PLS and elastic net models do not do a very good job predicting lifespan, especially chronological lifespan. I think it may have to do with the fact that the distribution of chronological lifespan is not normal (almost bi-model).

We know that density is highly correlated with omics, in order to control for the effects of density, I ran all models again with density fold change as a covariate.

Partial least square models with density

The RespProt model has RMSE of 10.62, and R squared of 0.28. The RespMet model has RMSE of 10.14 and R squared of 0.29. The RespLip model has RMSE of 10.53 and R squared of 0.19. From the plots we see that there is slight improvement in prediction after adding density as a covariate.

The FermProt model has RMSE of 4.94, and R squared of 0.23. The FermMet model has RMSE of 5.36 and R squared of 0.14. The FermLip model has RMSE of 5.51 and R squared of 0.09.

Random forest models with density

The RespProt model has RMSE of 10.26, and R squared of 0.27. The RespMet model has RMSE of 10.24 and R squared of 0.29. The RespLip model has RMSE of 10.63 and R squared of 0.21.

The FermProt model has RMSE of 5.07, and R squared of 0.19. The FermMet model has RMSE of 5.48 and R squared of 0.06. The FermLip model has RMSE of 5.42 and R squared of 0.11.

Elastic net models with density

The RespProt model has RMSE of 10.14, and R squared of 0.25. The RespMet model has RMSE of 10.42 and R squared of 0.22. The RespLip model has RMSE of 10.61 and R squared of 0.20.

The FermProt model has RMSE of 5.02, and R squared of 0.21. The FermMet model has RMSE of 5.23 and R squared of 0.15. The FermLip model has RMSE of 5.61 and R squared of 0.08.